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ABSTRACT 

Recent trends in writing skill assessment suggest a 
movement toward the use of f ree-respons e writing tasks and away from 
the traditional mul t ipl e~choi ce test. A number of national 
examinat i ons , including major college admissions tests, have included 
f ree-respons e components. Most of the arguments in support of this 
trend relate to the hypothesized effects of testing on curriculum and 
instruction, but others center around systemic validity and 
authenticity. There are questions in these areas, however, beginning 
with the question of what the content of a writing assessment should 
be. The reliability of f ree-response writing tests is often reported 
in terms of interrater reliability, but correlations of scores 
assigned by different raters can inflate the estimate of reliability. 
Combining assessment types, essay and multiple choice, is a way to 
improve reliability that is proving workable. The predictive 
effectiveness of writing skill assessments is related to reliability. 
Issues of fairness, comparability, cognitive complexity, and cost and 
efficiency must be addressed in the construction of f ree-respons e 
writing skill assessments. Technology seems to be an important key to 
the future of writing skill assessment. The future seems to one of 
increasing acceptance of performance tasks, and these will be best 
administered through the computer. (Contains 1 figure and 51 
references.) (SLD) 
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Preface 



Recent trends in the assessment of writing signal a 
turn away from traditional multiple-choice tests and 
toward the assessment of actual writing performance. 
Hunter Breland draws on the experience of a wide range of 
programs in which writing is assessed to provide a compre- 
hensive review of assessment practices, new and old. He 
describes the arguments for more authentic writing assess- 
ment as well as the important issues of validity, reliability, 
comparability, and fairness that must be considered. His 
long experience in research and development in the writing 
assessment areas uniquely qualifies him to extract from 
this experience what is useful to a nontechnical audience. 

This report is in the Policy Issue Perspective series 
published by the Center. In this series, research and expe- 
rience are combined to present both knowledge and profes- 
sional judgment. 

The manuscript was reviewed by Brent Bridgeman 
and Claudia Gentile at ETS. Gwen Shrift edited it and 
Carla Cooper provided desktop publishing services. 



Paul E. Barton 
Director 

Policy Information Center 
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Introduction 



Recent trends in writing 
skill assessment suggest 
a distinctive movement 
toward the use of free- 
response writing tasks 
and away from tradi - 
tional multiple-choice 
assessments . 



Despite these trends in 
some programs , how- 
ever , it is also clear that 
other testing programs 
are not joining this 
trend or are doing so 
only in moderation. 




Recent trends in writing skill assessment suggest a 
distinctive movement toward the use of free-response writ- 
ing tasks and away from traditional multiple-choice assess- 
ments. These trends relate to a more general movement 
toward performance-based assessment. Testing programs 
that have added free-response essay assessments recently 
include a number of statewide assessments, the National 
Assessment of Educational Progress (NAEP), the Medical 
College Admission Test (MCAT), and the Graduate Man- 
agement Admission Test (GMAT). The Graduate Record 
Examination (GRE) program is planning to add at least one 
free-response essay. The Law School Admission Test 
(LSAT) introduced a free-response writing task early in 
1982. Despite these trends in some programs, however, it 
is also clear that other testing programs are not joining 
this trend or are doing so only in moderation. American 
College Testing (ACT) English tests used in college admis- 
sions are still multiple-choice, although some smaller ACT 
testing programs use writing samples. The Scholastic As- 
sessment Test (SAT) recently introduced an optional writ- 
ing skill assessment that includes a combination of essay 
and multiple-choice questions. The Tests of General Educa- 
tional Development (GED) Writing Skills Test (WST), the 
Advanced Placement English Language and Composition 
examination, and the Praxis tests for teacher certification 
all use combinations of essay and multiple-choice questions. 
Further details on these testing programs illustrate the 
approaches being used. 

Statewide Assessment Programs. According to the June 
1995 annual report, The Status of State Student Assessment 
Programs in the United States, and its associated database, 
published by the Council of Chief State School Officers 
(CCSSO, 1995a, 1995b) and the North Central Regional 
Educational Laboratory (NCREL), 47 states have assess- 
ment programs. The number of states using performance- 
based assessments has grown from 17 in 1991-92, to 23 in 
1992-93, to 25 in 1993-94. There is a clear trend toward 
writing samples, criterion-referenced testing, and alterna- 
tive assessments, and away from norm-referenced multiple- 
choice assessments, in some states. Writing samples are 
used in 38 states. The number of states using writing port- 
folios remained constant at seven over this same period, 
however. Seventeen states use combinations of multiple- 
choice and performance tasks, while seven states use only 
multiple-choice assessments and two states use only alter- 
native assessments coupled with writing samples. States 
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report that the major purposes of their assessments are 
improving instruction (43 states), school performance report- 
ing (41 states), program evaluation (37 states), and student 
diagnosis (26 states). Only 17 states use their assessments 
for high school graduation, only 12 use them for school ac- 
creditation, and only two for teacher evaluation. Students are 
assessed most often in grades 4, 8, and 11. A special problem 
in statewide assessment is the requirement, often by law, 
that the same assessment be used for both accountability and 
instructional purposes. 



A special problem in 
statewide assessment 
is the requirement, 
often by law, that 
the same assessment 
be used for both 
accountability and 
instructional purposes. 



The National Assessment of Educational Progress 
(NAEP). The NAEP writing measure has always been a 
direct measure of writing, and it did not change much from 
the 1970s up until 1992. In 1992, a new framework was 
developed that increased the administration time from 15 
minutes to 25 minutes and included 50-minute prompts at 
grades 4 and 8. Additionally, a planning page was included 
after each prompt, and the scoring rubrics were increased 
from 4 to 6 levels. Finally, a writing portfolio was introduced 
in 1992 to provide an in-depth look at classroom writing 
(NAEP, 1994a, 1994b). It is important to note that the NAEP 
testing program is intended to produce aggregate data for 
national or state samples of students, and thus does not 
encounter the same kinds of problems as testing programs 
aimed at individual assessment. 



Medical College Admission Test (MCAT). The MCAT 
introduced a writing skill assessment consisting of two 
30-minute essays in 1991. The essay topics present a brief 
quotation; the examinee is asked to explain the meaning of 
the quotation and then answer specific questions about it. 
Since the MCAT is an all-day battery of tests with about six 
hours of actual testing time, the new writing skill assessment 
represents only about one-sixth of total testing time. 

Graduate Management Admission Test (GMAT). The 
GMAT writing assessment was introduced in 1994. Similar 
to the MCAT writing assessment, the GMAT Analytical 
Writing Assessment (AWA) consists of two 30-minute writing 
tasks. One of the 30-minute writing tasks is called “Analysis 
of an Issue” and the other “Analysis of an Argument.” The 
AWA is designed as a direct measure of an examinee’s ability 
to think critically and communicate ideas. The issues and 
arguments are often presented as quotations, as in the 
MCAT, but the quotations are longer. The responses are 
scored holistically, and copies of responses are included in 
admissions materials. The writing assessment represents 



about one-fourth of total GMAT testing time. The GMAT 
also includes, as part of the GMAT Verbal Reasoning mea- 
sure, a 25-minute sentence correction test in multiple- 
choice format. 

Graduate Record Examination (GRE). The GRE plans 
to introduce at least one 45-minute essay in 1999. An addi- 
tional 30-minute essay will be included if field tests now 
underway support it. The writing measure will represent 
somewhere between one-third and one-half of total testing 
time, depending on the outcomes of the field tests. 

Law School Admission Test (LSAT). The LSAT intro- 
duced a 30-minute writing sample in 1982. Rather than 
being scored, however, the LSAT writing sample is repro- 
duced and included with admissions materials for each law 
school applicant. 

American College Testing (ACT) English Test. The ACT 
English test is a 45-minute multiple-choice test with 75 
questions. The test assesses understanding of the conven- 
tions of grammar, sentence structure, and punctuation, as 
well as strategy, organization, and style. Five passages are 
presented, and each passage has several questions associ- 
ated with it. Some questions refer to the passage as a 
whole, while other questions are about underlined words or 
phrases. Three scores are reported: a total score, a subscore 
on usage and mechanics, and a subscore on rhetorical 
skills. No free-response writing is required. 

Scholastic Assessment Test (SAT). In 1994, the SAT 
was revised to include two parts: SAT I Reasoning and SAT 
II Achievement. The SAT II assessment is separate from 
SAT I and may or may not be required by institutions to 
which students are seeking admission. SAT II Writing, 
which is administered five times per year, consists of a 
20-minute essay and a 40-minute multiple-choice test based 
on sentences and brief passages. A total score is reported, 
as are scores for both the essay and the multiple-choice 
tests. 

Tests of General Educational Development (GED). The 
GED tests are used to grant a high school diploma to adults 
who did not complete high school. The Writing Skills Test 
(WST) of the GED consists of 50 multiple-choice questions 
and a single essay. Examinees have two hours to complete 
the WST, and they are advised to use 75 minutes for the 
multiple-choice questions and 45 minutes for the essay. The 



Most of the arguments 
for performance-based 
testing relate to 
hypothesized effects of 
testing on curriculum 
and instruction , but 
there are other argu- 
ments as well. 



essay is scored by two trained readers. The essay and mul- 
tiple-choice sections are weighted (.36 and .64, respectively) 
and then scaled to form a single composite score. 

Advanced Placement English Language and Composition 
(AP/EL&C). The AP/EL&C examination consists of 60 mul- 
tiple-choice questions and three essays. Each of the three 
essays is read and scored by a different reader, and the total 
essay score is weighted 60 percent (versus 40 percent for the 
multiple-choice portion) in a composite grade reported on a 
1-5 scale. 

Praxis. The Praxis teacher certification tests, initiated in 
1992 as a successor to the National Teacher Examinations 
(NTE), include a writing test with 45 multiple-choice ques- 
tions and one 30-minute essay. The multiple-choice writing 
test can be taken in either a paper-and-pencil mode or as a 
computer-based test. If the computer-based multiple-choice 
test is taken, the essay may be written either with paper and 
pencil or with a word processor. The multiple-choice ques- 
tions test understanding of subject-verb agreement, noun- 
pronoun agreement, correct verb tense, parallelism, clarity, 
and other conventions of standard written English. The 
prompts for the essay pose questions of relevance to teachers. 



Arguments for Free-Response Writing Tasks 

The arguments for the use of free-response writing tasks 
in the assessment of writing skill are essentially those of the 
performance testing movement, in which writing is often a 
focus. Most of the arguments for performance-based testing 
relate to hypothesized effects of testing on curriculum and 
instruction, but there are other arguments as well. The 
various arguments often overlap and, at times, seem to be 
the same argument using different terminologies. 

Decomposition / decontextualization. One of the more 
elaborate arguments is that of Resnick & Resnick (1990). 

The argument begins by stating that two key assumptions, 
decomposability and decontextualization, underlie traditional 
standardized testing. The assumption of decomposability is 
that thought can be fractionated into independent pieces of 
knowledge, as in multiple-choice tests of writing skill when 
brief, independent problems in sentences are posed rather 
than a requirement for actual composition. The decontextua- 
lization assumption is that competence can be assessed “in a 
context very different from that in which it is practiced and 



Systemically valid tests 
U induce curricular and 
instructional changes 
in education systems ” 
and (( foster the develop- 
ment of the cognitive 
traits that the tests are 
designed to measure” 



Authentic tasks are seen 
by Wiggins as those that 
are non-routine and 
multistage , that require 
the student to produce a 
high-quality product , 
and that are transpar- 
ent in the sense that the 
student knows what to 
expect and can prepare 
for them. 



used,” [p. 71] as in standardized testing in writing for 
which examinees are asked to edit the writing of someone 
else. Essay assessments in writing for which judges 
evaluate performances sire given as an example of perfor- 
mance assessments, which are seen as a means for releas- 
ing educators from “the pressure toward fractionated, 
low-level forms of learning rewarded by most current 
tests . . [p. 78]. 

Systemic Validity. Systemically valid tests “induce 
curricular and instructional changes in education systems” 
and “foster the development of the cognitive traits that the 
tests are designed to measure” (Frederiksen & Collins, 
1989). Such tests are described as being direct (as opposed 
to indirect) and require subjective judgment in the assign- 
ment of scores. As in the decomposition/decontextualization 
argument, tests that emphasize isolated skill components, 
rather than higher-level processes, are seen as having a 
negative impact on instruction and learning. Free-response 
essay tests of writing skill are cited as examples of systemi- 
cally valid tests. 

Authenticity. Wiggins (1989, 1993) has argued that 
traditional standardized testing is not “authentic” (e.g., 
Wiggins, 1989, 1993). This another way of saying that tests 
are often not representative of real-life tasks. A multiple- 
choice test of verbal analogies, for example, is easily shown 
to involve tasks that are not encountered in everyday life 
in either school or work. However, Wiggins (1993) also 
observes that very brief essay tests are not authentic 
because in real life one has more time to write: 

“Thus whatever assessors are testing in a 
20-minute essay, it is certainly not the ability 
to write. As those of us who write for a living 
know, writing is revision, a constant returning 
to the basic questions of audience and pur- 
pose . . .” [p. 208], 

Authentic tasks are seen by Wiggins as those that are 
nonroutine and multistage, that require the student to 
produce a high-quality product, and that are transparent in 
the sense that the student knows what to expect and can 
prepare for them. Dwyer (1993) observed considerable 
confusion about just what “authentic assessment” is per- 
ceived to be, however. 



Teaching to the Test. Archbald & Porter (1990) described 
two main lines of argument advanced in criticisms of educa- 
tional testing. The first is that testing adversely affects cur- 
riculum and instruction: 



“Mandated, student testing is conducted almost 
exclusively using facts and skills-dominated 
multiple-choice tests. Because there is account- 
ability pressure for schools to achieve high test 
scores . . teachers are forced to ‘teach to the tests’ 

— that is to shape their curriculum and instruc- 
tion around the goal of developing students’ test- 
taking abilities” [p. 34]. 

This argument is continued by noting that what the tests do 
not measure does not get taught. Creativity, depth of under- 
standing, integration of knowledge, ill-structured problem 
solving, and communication, for example, are not often in- 
cluded in tests and thus do not get taught. Another way of 
referring to this argument is to say that “the assessment tail 
wags the curriculum dog” (Swanson, Norman, and Linn, 
1995). 



A sophisticated line 
of argument support- 
ing performance 
testing comes from 
the growing fields 
of cognitive and 
instructional psychol- 
ogy and from the 
testing establishment 
itself . . . 




Teacher Professionalism. A second argument against 
traditional testing given by Archbald and Porter (1990) is 
that it erodes teacher professionalism. When tests are used 
to make judgments about teacher or school quality, as well as 
promotion or retention of students, they exert a strong influ- 
ence on what is taught and undermine teachers’ pedagogical 
autonomy and feelings of professional worth. Similarly, 

White (1994), in the context of writing skill assessment, 
argues that holistic scoring of writing samples (as contrasted 
to multiple-choice tests) requires an “interpretive commu- 
nity’ of teachers of writing “whose work is made meaningful 
by a joint social purpose” [p. 281]. 

Cognitive Science. A sophisticated line of argument 
supporting performance testing comes from the growing 
fields of cognitive and instructional psychology and from the 
testing establishment itself, which in recent years has been 
more influenced by these academic disciplines. Part of this 
support comes from increasing interest in diagnosis and 
feedback. Nichols (1994) describes a new type of assessment 
termed “cognitively diagnostic assessment” (CDA) that 
requires analysis of “processes and knowledge structures 
involved in performing everyday tasks.” Although CDA is 
responsive to some of the same educational concerns as 
performance-based testing, Nichols does not support all 
performance-based testing: 
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In writing assessment, 
for example, free - 
response essay tests, 
when scored holistically, 
do not provide sufficient 
diagnostic feedback to 
inform instruction. 



“■ . .scores on new performance-based or 
authentic assessments often provide little more 
information than traditional assessments to 
guide specific instructional decisions . Perfor- 
mance-based or authentic assessments may 
well consist of tasks that are more representa- 
tive of some intended domain; however, these 
assessments continue to be developed and 
evaluated with an eye toward the same crite- 
rion — estimating a persons location on an 
underlying latent continuum. In either case, 
scores indicate no more than the need for 
additional instruction” [p. 578]. 



In writing assessment, for example, free-response essay 
tests, when scored holistically, do not provide sufficient 
diagnostic feedback to inform instruction. CDA models 
focus on patterns of responses rather than average or total 
scores. The focus on patterns of responses is also reflected 
in arguments advanced by Mislevy (1993) for a new para- 
digm for assessment and by those interested in the psychol- 
ogy of problem solving (e.g., Snow & Lohman, 1989). 



As performance assess- 
ments have begun to be 
widely implemented, a 
number of articles have 
appeared about the 
standards of quality 
that performance tests 
should satisfy and about 
validity, reliability, 
comparability, fairness, 
and other measurement 
issues. 



Cautions from the Measurement Community 

As performance assessments have begun to be widely 
implemented in statewide assessments and in national 
admissions testing, a number of articles have appeared in 
educational measurement journals posing questions about 
the standards of quality that performance tests should 
satisfy and about validity, reliability, comparability, fair- 
ness, and other measurement issues (e.g., Dunbar, Koretz, 
& Hoover, 1991; Linn, Baker, & Dunbar, 1991; Linn, 1994; 
Linn & Burton, 1994; Mehrens, 1992; Messick, 1994a, 
Messick, 1994b; Messick, 1995; Brennan & Johnson, 1995; 
Green, 1995; Bond, 1995). The following sections discuss 
these and other measurement concerns with a focus on 
writing skill assessment. Note that assessments of content 
knowledge, through the use of writing, are excluded from 
this discussion. 

Content. What should be the content of a writing skill 
assessment? Ideally, the content of such an assessment 
might be based on models of writing skill developed from 
protocol analyses in which subjects are observed and asked 
to think aloud about what they are doing as they respond 
to a writing assignment. One of the most extensive writing 
model developments has been conducted by Hayes & 



The central compo- 
nents of this model are 
the “cognitive writing 
processes ” viz., 
planning, text genera- 
tion, and revision. 



Because writing 
assessments are 
usually constrained 
by a number of fac- 
tors . . . they rarely 
completely cover the 
domain of skills 
indicated by the 
writing models. 



Flower (1980), Flower & Hayes (1981), and Hayes (1996) 
after many years of protocol analysis. The latest version of 
this model is shown in Figure 1. The central components of 
this model are the “cognitive writing processes,” viz., plan- 
ning, text generation, and revision. Other models of writing 
skill are in general agreement with the Hayes and Flower 
model. For example, Collins and Gentner (1980) separate 
writing into idea production and text production. Idea pro- 
duction occurs by keeping a journal of interesting ideas, 
obtaining ideas from others, reading books and other source 
materials, and attempting to explain one’s ideas to someone 
else. Text production includes initial drafting, revision, and 
editing. 

Because writing assessments are usually constrained 
by a number of factors including the time available, security 
considerations, cost and reporting requirements, they rarely 
completely cover the domain of skills indicated by the writing 



Figure 1 

The Hayes and Flower Model 
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models. Giving examinees advance notice of the topic to be 
written on (so that they will have time to generate ideas) 
cannot usually be allowed because that would create a test 
security problem. Quite often, a choice is made between 
drafting and revision because of time constraints. NAEP, 
MCAT, GMAT, LSAT, and some statewide assessments 
have opted for drafting only, with essays, letters, or some 
other free response as the vehicle. The planned GRE writ- 
ing assessment also intends to assess drafting skills. Most 
often the free-response tasks assigned must be produced in 
a relatively brief period of time, although NAEP and sev- 
eral statewide assessments use writing portfolios in addi- 
tion to brief assignments. 



A more important 
source of error in free- 
response writing skill 
assessments is that due 
to the sampling of tasks 
for the writing assign- 
ments. 




When a balance of assessment types is used, the 
amount of time available for each type becomes even more 
limited. In the SAT II writing assessment, for example, 
only 20 minutes of the one-hour test is allowed for the 
essay. If the entire test were free-response only, there 
would be no time for revision and editing tasks. That is 
the choice that has been made for the NAEP, MCAT, and 
GMAT writing tests, and planned for the GRE writing 
measure. This dilemma is partly resolved by recognizing 
that, “no matter how realistic a performance-based assess- 
ment is, it is still a simulation, and examinees do not 
behave in the same way they would in real life” (Swanson, 
Norman, and Linn, 1995). Nevertheless, the GMAT writing 
measure introduced in 1994 is already being criticized in 
the management literature for requiring only drafting, 
among other criticisms (Rogers & Rymer, 1995a, 1995b). 

Reliability. The reliability of free-response writing 
examinations is often reported in terms of inter-rater 
reliability. This can be simply the correlation of scores 
assigned by two different raters to the same set of free 
responses. Unfortunately, such correlations always inflate 
the estimate of reliability, because only the error intro- 
duced by the raters is included. A more important source of 
error in free-response writing skill assessments is that due 
to the sampling of tasks for the writing assignments. What 
this means is that error is introduced because the examinee 
may be allowed to write on only a single topic chosen by the 
examiner. This topic may be an easy one for some examin- 
ees, because they may have recently written or thought 
about it, but a difficult topic for other examinees because 
they may never have thought about it. Accordingly, while 
the inter-rater correlation may be as high as .90, the score 
reliability for the same assessment may be only .50 because 
of the error introduced by topic sampling. See Reckase 
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. . . essay tests of the 
type most often used 
(one topic, two read- 
ers) have score 
reliabilities of around 
.50, on average. 



The only way that the 
reliability problem of 
free-response tasks 
can be resolved is 
through the use of 
multiple samples 
written by the same 
examinee, with each 
scored independently 
by multiple raters. 



The reliability of 
writing assessments 
can be increased by 
combining assessment 
types (essay and 
multiple-choice) as is 
done for the SAT II, 
GED, Advanced 
Placement, and Praxis 
writing assessments. 



(1995b) for farther explanation of this problem. Dunbar, 
Koretz, and Hoover (1991) reviewed reliability studies for 
common types of assessments and showed that essay tests of 
the type most often used (one topic, two readers) have score 
reliabilities of around .50, on average. It is often more useful 
to examine the standard error of measurement, derived from 
reliability, as recommended by Linn (1994) and Linn & Bur- 
ton (1994). The standard error of measurement is especially 
important when using assessments to make classification 
decisions such as pass or fail. 

Better reliabilities can be obtained by using more tasks 
and more raters. The MCAT and GMAT writing assessments, 
with two tasks of about 30 minutes each, and each task 
scored independently by two different raters, will produce 
much better reliabilities than the usual free-response assess- 
ment. Breland et al. (1987) estimated that essay assessments 
of this type could yield score reliabilities in excess of .70. 
Nevertheless, Linn (1994) suggests that reliabilities even as 
high as .80 can be problematical. 

The only way that the reliability problem of free- 
response tasks can be resolved is through the use of multiple 
samples written by the same examinee, with each scored 
independently by multiple raters. Reckase (1995a) shows 
that to approximate a .80 reliability, a writing assessment 
of five different samples is required, and the components of 
the assessment need to be similar rather than disparate. 
Breland et al. (1987) estimated that, for a hypothetical port- 
folio of six essays, each scored by three different raters, a 
score reliability of .88 could be attained. These high 
reliabilities are obtained only when all examinees write on 
the same topics under the same conditions and when the 
scoring is conducted by the same raters. That is, the tasks 
and administrative conditions (timing, scoring) are standard- 
ized. Writing portfolios, for which examinees submit writing 
samples of their own choice written on widely different topics 
and under varying conditions, are not likely to attain such 
high levels of reliability. Some statewide assessments using 
writing portfolios, notably one initiated in Vermont in 1988, 
have encountered serious problems with reliability (Koretz, 
Stecher, Klein, & McCaffrey, 1994). Better reliabilities were 
obtained with the NAEP writing portfolio, however (Gentile, 
1992). 

The reliability of writing assessments can be increased 
by combining assessment types (essay and multiple-choice) as 
is done for the SAT II, GED, Advanced Placement, and Praxis 



writing assessments. The GED writing assessment, with a 
single 45-minute essay and 50 multiple-choice questions, 
yields a reliability of about .87 (Patience & Swartz, 1987; 
Lukhele & Sereci, 1995; Wiley & Sireci, 1994). The GED 
essay is scored by two trained readers on a six-point scale. 
The Advanced Placement English Language and Composi- 
tion examination, with three free-response tasks each 
scored by a single reader, and a 100-item multiple-choice 
test, produces reliabilities in the .78 to .90 range (College 
Board, 1988). 



As a final note on reliability of writing assessments, 
it is important to point out that the NAEP assessments, 
as aggregations of data intended for the assessment of 
national trends, do not have the same reliability problems 
as do assessments intended to produce scores for individual 
examinees. Similarly, statewide assessments intended for 
decision making at the school or district level can resolve 
reliability problems by pooling data for an entire school or 
district as well as across years, as is done for the Kentucky 
Instructional Results Information System (KIRIS) analyzed 
by Haertel (1994). 



To assess predictive 
effectiveness, test scores 
are correlated with some 
later outcome, such as 
grades in English 
courses or freshman 
grade point average 
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Predictive Effectiveness. The predictive effectiveness 
of writing skill assessments, important for admissions 
tests, is related to reliability. High reliability is a necessary 
but not a sufficient condition for predictive effectiveness. 
But even a highly reliable test will not predict well if it is 
assessing the wrong skills. To assess predictive effective- 
ness, test scores are correlated with some later outcome, 
such as grades in English courses or freshman grade point 
average (GPA). Bridgeman (1991), for example, analyzed 
the effectiveness of multiple-choice and essay assessments 
of writing skill for predicting college freshman GPA and 
obtained median correlations across 21 colleges of .30 for 
the multiple-choice assessment and .16 for the essay. The 
essay assessment did not add incrementally to the predic- 
tive effectiveness possible using multiple-choice tests alone. 
When English composition course grades, rather than GPA, 
have been used as the criterion, however, incremental 
predictive validity for essays has been observed (Breland 
and Gaynor, 1979). 



Much higher predictive correlations can be obtained if, 
instead of grades, scores on performance tests of writing are 
predicted. Breland et al. (1987) obtained high correlations 
for predicting writing performance. A writing performance 
assessment consisting of five essays, each scored by three 
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different readers, correlated well with a number of predic- 
tors. A 30-minute multiple-choice test of grammar and sen- 
tence structure questions correlated .70 with the writing 
performance assessment. When the same 30-minute mul- 
tiple-choice test was combined with a single 45-minute essay 
assessment with two independent readings, the predictive 
correlation increased to .77. Two essays alone, when used to 
predict scores on a writing performance assessment consist- 
ing of four essays, yielded predictive correlations in a range 
from .61 to .75. It seems clear from these analyses that good 
predictions of writing performance can be made using mul- 
tiple-choice tests alone, essays alone, or combinations of 
essays and multiple-choice tests. When essays are used 
alone, however, it is preferable to use more than a single 
essay and more than a single reader of each essay. 

Fairness. In writing skill assessment, fairness issues 
have tended to focus on differences in gender, race, and 
language. There is much evidence to suggest that women 
tend to write better than men, on average, and that this 
advantage is more pronounced in free-response assessments 
than it is in multiple-choice assessments. Members of racial 
minority groups, as well as linguistic minorities, tend to score 
lower than non-minorities on all types of writing skill assess- 
ments (see, e.g., Breland & Griswold, 1982; Klein, 1989; 
Murphy, 1982; Petersen & Livingston, 1982). A special prob- 
lem encountered by linguistic minorities occurs when writing 
skill assessments consume a large proportion of the testing 
time of a more comprehensive assessment. 

Contrary to popular opinion, however, performance 
assessments do not necessarily result in better outcomes 
for minorities than multiple-choice assessments. Bond 
(1995) cites a number of papers suggesting that, for NAEP, 
extended-response essays resulted in mean differences 
between African Americans and Whites that were equal to 
those for the multiple-choice reading assessment. After cor- 
recting for unreliability, the mean differences actually 
exceeded those found on the multiple-choice reading assess- 
ment. Klein (1989) showed that increased essay testing on 
the California bar exam did not reduce the differences in 
passing rates between White and minority groups. 

Women have been shown to perform better on essay 
examinations than would be expected from their scores on 
multiple-choice tests of writing skill (e.g., Breland and 
Griswold, 1982), and in the Klein (1989) bar exam study, the 
passing rate for women increased when the amount of essay 
testing was increased. 



Unfortunately, evidence 
on the consequences of 
new forms of assessment 
is rarely assembled. 
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Consequences. Messick (1995) observes that it is 
important to collect evidence of both positive and negative 
consequences of performance assessments. If the promised 
benefits to teaching and learning occur, then this is evi- 
dence in support of the validity of performance assess- 
ments. If such benefits do not occur, or if there are negative 
consequences, that is also important to document. If nega- 
tive consequences result, it is important to determine their 
causes. If some examinees receive low scores because some- 
thing is missing from the assessment, this is evidence of 
construct under-representation. That is, for example, if a 
writing assessment consists only of questions about how to 
revise text, and does not allow an examinee to demonstrate 
an ability to produce text, then the construct as defined by 
the Hayes and Flower model of writing is underrepresented. 
Additionally, low scores should not occur because the 
assessment contains irrelevant questions. In writing 
assessment, for example, an essay prompt on a topic that 
examinees are unlikely to be familiar with could affect 
performance unfairly. 

Unfortunately, evidence on the consequences of new 
forms of assessment is rarely assembled. In the health 
professions, for example, Swanson, Norman, and Linn 
(1995) could find only two examples of systematic research 
on the impact of changes in examinations. One reason for 
the failure to conduct research on consequences is that it is 
difficult (Linn, 1994). It may require a number of years for 
a new assessment to produce observable changes in the 
behaviors of students or teachers, or the changes may be so 
gradual that they are not easily detected. 

Comparability. In order to make comparisons of 
assessments from year to year or from administration to 
administration, the assessments must mean the same thing 
on different occasions. This means that they must be of 
comparable content and of comparable difficulty. In writing 
skill assessment, comparability is a particularly trouble- 
some problem because individual free-response tasks are 
quite often not comparable. Comparability problems are 
alleviated to some extent through the use of multiple tasks, 
as in NAEP and some statewide assessments, or through 
the combination of free-response tasks with multiple-choice 
items, as is done for SAT II, Advanced Placement, the GED, 
and Praxis. 

To ensure comparable content requires careful atten- 
tion to test specifications (Green, 1995). With traditional 
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multiple-choice tests, test specifications are made compa- 
rable across testing occasions by balancing the number of 
items of each type. With a large number of items, balancing 
test specifications is not difficult. In writing skill assessment 
using free responses, however, the number of tasks is usually 
quite small, and each task may require from 20 minutes to 
an hour of time. As a result, comparability of content may be 
difficult to maintain. It is of course essential, in addition, to 
control the exposure of free-response tasks so that the con- 
tent does not become known prior to a test administration 
(Mehrens, 1992). The scoring of free-response tasks must be 
carefully controlled across administrations by use of the 
same scoring rubrics and reader training from year to year or 
from administration to administration. 

Finally, comparability is also affected by the reliability of 
a test, since the less reliable the test the less reliable will be 
equating across forms and occasions (Green, 1995). Some 
free-response writing tasks do not appeal to some examinees, 
and the resulting examinee-task interaction tends to lower 
reliability. A number of studies have demonstrated that 
examinee-task interactions are a major source of error in 
essay examinations (Breland et al., 1987; Brennan & 
Johnson, 1995; Coffman, 1966). 



It is important to note that, despite the considerable 
difficulties resulting from comparability problems in indi- 
vidual performance assessment, aggregate assessments such 
as NAEP, statewide, districtwide, and schoolwide assess- 
ments can often be made comparable through careful sam- 
pling designs and the rotation of free-response prompts 
(Green, 1995; Linn, Baker, & Dunbar, 1991). 

Cognitive complexity. Proponents of performance testing 
often note the need for “higher-level” assessments or for “ill- 
structured problems” (e.g., Frederiksen, 1984; Resnick & 
Resnick, 1990). Unfortunately, not all examinees can solve 
such difficult problems and, as a result, cannot be accurately 
evaluated by them. An example in writing skill assessment is 
when a prompt is a quotation of some type. The quotation 
may be a famous one such as Descartes’ “I think, therefore I 
am.” The examinee’s task is to write a well-organized and 
unified essay in response to a question about such a quota- 
tion. In the MCAT, for example, the first task is to explain 
what the quotation means. If the examinee does not know 
what the quotation means, which seems quite likely, then it 
is difficult to write anything at all. That is, the requirement 
of cognitive complexity interacts with the consequences 
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discussed earlier. Messick (1994b) observed that “low scores 
should not occur because the measurement contains some- 
thing irrelevant that interferes with the affected persons’ 
demonstration of competence.” In standardized testing, it 
has long been recognized that a range of difficulties in 
questions is needed to obtain good assessments for all 
examinees. 



Realism. It is often assumed that an essay examination 
is unquestionably a realistic representation of a real-life 
task. A brief 20- to 45-minute essay, in response to a topic 
previously unknown to the examinee, is hardly realistic. 
Truly realistic assessments of writing skill would require 
several samples of writing produced without severe time 
constraints and evaluated by multiple judges. In most 
writing assessments the time available is limited, and 
administrative and scoring costs must be controlled. Most 
writing skill tests, therefore, can only be simulations of the 
skill being assessed. From this perspective, a brief essay, 
even if on an impromptu topic, is a useful simulation of 
writing. Likewise, a brief editing task is also a useful simu- 
lation of a real-life task even if the writing to be edited was 
written by someone other than the examinee. 

Cost and Efficiency. Writing portfolios and multiple- 
choice tests of writing represent two extremes on a cost/ 
efficiency continuum. There can be little doubt that a care- 
fully designed writing portfolio is a reliable and valid as- 
sessment of writing skill. Moreover, such a portfolio should 
have a positive impact on the curriculum. Nevertheless, a 
writing portfolio would not be the optimum assessment for 
all purposes because of its cost and because of the amount 
of time required to provide feedback to the examinee or to 
the educational system. As used in NAEP for a representa- 
tive national sampling of students, a writing portfolio 
seems appropriate, as it can be for some statewide assess- 
ments. For individual assessment, as in college and gradu- 
ate school admissions, a writing portfolio would be exces- 
sively expensive. The time required to develop and report 
scores would not be compatible with timing requirements in 
the admissions cycle of events. 



At the other extreme of this cost/efficiency continuum, 
the multiple-choice test of the conventions of standard 
written English seems appropriate for some purposes but 
not for others. If the test has been demonstrated to be both 
reliable and valid for the purpose intended (e.g., the predic- 
tion of writing performance in college or graduate school), 
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then it would seem to be a likely candidate for use in those 
situations. Of course, all considerations described above, 
including fairness and consequences, need to be evaluated as 
well. It may be that the addition of a brief essay to the mul- 
tiple-choice test will enhance its validity or fairness and have 
positive impact on the educational system. 

Between the extremes of the writing portfolio and the 
multiple-choice test are other options, including the use of 
multiple brief writing tasks as in the MCAT and the GMAT. 
Here, the disadvantages of relatively low reliability and 
incomplete content coverage (with no revision or editing) may 
be offset by considerations of fairness and consequences 
while cost and efficiency, while not optimum, are acceptable. 



Conclusion 

The performance testing movement has had a positive 
impact on writing skill assessment practices. Essays and 
other free-response tasks are now being used much more 
widely, and their use helps to make writing assessments 
more representative of real-world writing by helping to cover 
more of the domain involved in actual writing. But it must be 
remembered that the writing tasks used in assessments, for 
the most part, can only be simulations of real-world writing. 
For that reason, free-response writing tasks may be no more 
authentic than any other kind of writing assessment. Some 
writing portfolios may closely approximate real-life tasks, but 
such portfolios are rare, and they are not suitable for all 
assessment purposes. 

Another thing to remember is that there is also a posi- 
tive side to the decomposition and decontextualization of 
writing skills. Most decomposed and decontextualized skills 
are much easier to teach and learn than writing itself. And 
these skills are important in their own right, even if they do 
not make a person a good writer. Take, for example, simple 
writing problems such as vague pronoun references, parallel 
structure in sentences, and transitions from one sentence to 
another. It is relatively easy to teach and learn about these 
simple kinds of writing problems, and they can be quite 
important in a person’s life (say, in writing a letter of applica- 
tion for a job). Recognizing such writing problems in the 
writing of others is no different than recognizing them in 
one’s own writing. Many other examples could be cited of 
decomposed and decontextualized writing skills that are 
important. 
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Finally, we need to remember that evidence has yet to 
be assembled to show that free-response writing skill assess- 
ment will improve the writing skill of the nation or that 
other kinds of writing skill assessment have diminished 
these skills. Perhaps the 1998 NAEP writing assessment, 
when compared to the 1992 NAEP writing assessment, will 
show an improvement. Until that occurs, or until other 
evidence shows that the nation’s writing has improved, it 
seems best to be cautious about radical changes in our 
writing skill assessments. 

The types of writing skill assessment that are most 
appropriate depend upon the purpose of the assessment. If 
the purpose is to gauge trends in national abilities, as in 
NAEP, or trends in states and districts, free-response writ- 
ing assessments clearly seem to be appropriate. Reliability 
is much less of a problem because of the aggregation of data 
at the national, state, or district level. For purposes of high- 
stakes individual assessment, as for admissions decisions 
for college or graduate school, reliability problems are 
formidable, however, and caution needs to be exercised. 

One approach to handling the reliability problem of free- 
response tasks is to combine both free-response and mul- 
tiple-choice tasks to make up the assessment, as is done for 
the SAT II writing assessment, Advanced Placement, 
Praxis, and the GED assessment. Another approach is to 
use multiple free-response tasks, as is done for MCAT and 
GMAT examinations, and which is being considered for the 
GRE writing assessment. Cost and other practicalities will 
usually limit the number of writing tasks used to two or 
three. The effects on reliability and validity of such limita- 
tions need to be examined. 

Long experience with performance-based assessments 
in the health professions supports a blend of assessment 
methods (Swanson, Norman, and Linn (1995). A similar 
conclusion was reached by Ackerman & Smith (1988) 
through a factor analysis of both essay tests and multiple- 
choice tests of revision skills. Miller & Crocker (1990) in 
their review of validation methods for writing assessment, 
came to the same conclusion: 

“Thus, it seems that when interested in provid- 
ing a complete description of writing ability, 
both direct and indirect writing assessment are 
needed” [p. 292]. 
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From a domain coverage perspective, the use of combined 
methods seems more likely to cover both drafting and revi- 
sion skills. 
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What are the prospects for the future? Technology seems 
to be an important key to the future of writing skill assess- 
ment. Already, for Praxis and other assessments, waiting 
samples are collected by computer and thus are available for 
analysis, transmission, and evaluation using computer-based 
technologies. The inefficiencies of collecting waiting samples 
on paper and having them evaluated by experts should be 
much less of a problem in the future. Samples can be trans- 
mitted to experts electronically for evaluation in their homes 
or offices wdthout the need for travel to a central facility for 
scoring. Thus the future of writing skill assessment would 
appear to be one of increasing acceptability of performance 
tasks, even portfolios, because of the efficiencies that wall be 
available through technology. Multiple-choice testing wall 
still be useful, however, though “bubbles” on answer sheets 
will probably be replaced by mouse clicks on the computer, as 
they have been on the College Board’s Computerized Place- 
ment Tests. It is quite likely that many editing tasks will be 
conducted by constructing responses rather than clicking on 
an answer. The computer should also make it possible to 
provide more diagnostic feedback to students and teachers 
than is currently possible. 
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